Inference for a difference between proportions

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • Quantifying the uncertainty for the difference between two sample proportions from independent samples
  • How to construct and interpret a confidence interval for the difference between two population proportions
  • Difference between two independent proportions ⇝ the odds ratio
  • How to use R to construct a confidence interval for the odds ratio
  • How to interpret a confidence interval for the odds ratio

On how to conduct and interpret a hypothesis test for the difference between two population proportions, see pages 477–480 Lock, R. H., Lock P. F., Morgan, K. L., Lock, E. F., & Lock, D. F. (2021). Statistics: Unlocking the power of data. Wiley.

ˆp1ˆp2 from two independent samples

CS 1.6 revisited: ICU admissions

Data from a sample of 200 patients following admission to an adult intensive care unit (ICU) in the United States of America.

Variables
Status A factor denoting whether the patient lived or died
Sex A factor denoting the patient’s sex, male or female
icu.df <- read.csv("datasets/ICU.csv")
nrow(icu.df)
[1] 200
# Two-way table of counts/frequencies
xtabs( ~ Sex + Status, data = icu.df) |>
  addmargins()
        Status
Sex      died lived Sum
  female   16    60  76
  male     24   100 124
  Sum      40   160 200
# Two-way table of proportions
xtabs( ~ Sex + Status, data = icu.df) |>
  proportions("Sex") |>
  round(2)
        Status
Sex      died lived
  female 0.21  0.79
  male   0.19  0.81

CS 1.6 revisited: ICU admissions

xtabs( ~ Sex + Status, data = icu.df) |>
  as.data.frame() |>
  barchart(Freq ~ Status, groups = Sex, data = _, origin = 0,
           main = "Status distribution by Sex", 
           xlab = "Status", ylab = "Count",
           auto.key = list(title = "Sex", space = "right"))

Figure: The ICU patient distribution of Status by Sex

xtabs( ~ Sex + Status, data = icu.df) |>
  proportions("Sex") |>
  as.data.frame() |>
  barchart(Freq ~ Status, groups = Sex, data = _, origin = 0, 
           main = "Status distribution by Race",
           xlab = "Status", ylab = "Proportion",
           auto.key = list(title = "Sex", space = "right"))

Figure: The ICU patient distribution of Status by Sex

Simulating random samples of two populations

Sampling distribution of ˆp1ˆp2

If both population proportions, \(p_1 ~ \& ~ p_2\), are known—The ground “truths” (parameters) that summarise all possible values we could observe

The sampling distribution of the sample proportion, \(\hat{p}_1 - \hat{p}_2\), is

\[ \hat{p}_1 - \hat{p}_2 ~ \text{approx.} ~ \text{Normal} \! \left( \begin{array}{l} \mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2, \\ \sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{p_1\times(1-p_1)}{n_1} + \frac{p_2\times(1-p_2)}{n_2}} \end{array} \right) \]

The use of the \(\hat{p}_1 - \hat{p}_2\) subscripts is to make it clear that we are talking about the sampling distribution of \(\hat{p}_1 - \hat{p}_2\) and not the possible values we could observe

Assumptions for inference on p1p2

  1. Two independent groups
  2. Within group: Independent observations
  3. Comparing the same level of interest for each group
  4. The following heuristics has to be met:
  • At least ten “yes” values and at least ten “no” values in the first group
    \(n_1 \times \hat{p}_1 \geq 10\) and \(n_1 \times (1 - \hat{p}_1) \geq 10\)
  • At least ten “yes” values and at least ten “no” values in the second group
    \(n_2 \times \hat{p}_2 \geq 10\) and \(n_2 \times (1 - \hat{p}_2) \geq 10\)

More on 4.

These heuristics are a consequence of relying only on the sampling distribution of \(\hat{p}_1 - \hat{p}_2\) for the method taught in DATAX121

Definition: se(ˆp1 - ˆp2)

The standard error of the sample proportion, \(\hat{p}_1 - \hat{p}_2\), is

\[ \text{se}(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1\times(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2\times(1-\hat{p}_2)}{n_2}} \]

where:

  • \(\hat{p}_1\) is the sample proportion for the level of interest in the first group
  • \(\hat{p}_2\) is the sample proportion for the same level of interest in the second group
  • \(n_1\) is the number of observations in the first group
  • \(n_2\) is the number of observations in the second group

CS 8.1: Production of wood pellets

The production of wood pellets often goes “out of specification”. To improve the number of pellets that conform to specifications, the manufacturer experimented with two new methods of producing pellets and randomly sampled 100 pellets produced with each method.

38 out of 100 sampled pellets produced with Method A conformed to specifications, while 29 out of 100 sampled pellets produced with Method B conformed to specifications.

Variables
Count An integer denoting the number of pellets within the group
Conform A factor denoting whether the group of pellets conformed to specifications, Yes or No
Method A factor denoting the method used to manufacture the group of pellets, A or B
# Read in the data
pellets.df <- read.csv("datasets/pellets-grouped.csv")

# Take a quick peek
head(pellets.df)
  Count Conform Method
1    38     Yes      A
2    62      No      A
3    29     Yes      B
4    71      No      B

CS 8.1: Production of wood pellets

Can we trust a confidence interval for the difference between two underlying proportions?

\(\text{se}(\hat{p}_A - \hat{p}_B) = 0.0664 ~ (4 ~ \text{dp})\)

barchart(Count ~ Conform, data = pellets.df, groups = Method,
         origin = 0, xlab = "Conformed?", 
         auto.key = list(title = "Method", space = "right"),
         main = "Distribution of conformation by Method")

Figure: The classification of whether the 200 sampled wood pellets conformed, grouped by the production method

A confidence interval for p1p2

Definition: (1 - α)% Confidence interval for p

\[ \hat{p}_1 - \hat{p}_2 \pm z^*_{1-\alpha/2} \times \text{se}(\hat{p}_1 - \hat{p}_2) \]

where:

  • \(\hat{p}_1\) is the sample proportion for the level of interest in the first group
  • \(\hat{p}_2\) is the sample proportion for the level of interest in the second group
  • \(n_1\) is the number of observations in the first group
  • \(n_2\) is the number of observations in the second group
  • The confidence level is \((1 - \alpha)\), where \(\alpha\) is a proportion
  • \(z^*_{1-\alpha/2}\) is the z-multiplier for the prescribed confidence level of \((1 - \alpha)\)
  • \(\text{se}(\hat{p}_1 - \hat{p}_2)\) is the standard error of \(\hat{p}_1 - \hat{p}_2\)—see Slide 9

CS 8.1: Production of wood pellets

Recall that 38 out of 100 sampled pellets produced with Method A conformed to specifications, while 29 out of 100 sampled pellets produced with Method B conformed to specifications

Construct a 99% confidence interval for \(p_A - p_B\).

\[ \text{se}(\hat{p}_A - \hat{p}_B) = 0.0664 ~ (4 ~ \text{dp}) \]

The solution is (-0.08115218, 0.26115218)

# The R function to find the z-multiplier
qnorm(0.995)
[1] 2.575829
# The R function to calculate it in one go
prop.test(x = c(38, 29), n = c(100, 100), correct = FALSE,
          conf.level = 0.99)

    2-sample test for equality of proportions without continuity correction

data:  c(38, 29) out of c(100, 100)
X-squared = 1.818, df = 1, p-value = 0.1776
alternative hypothesis: two.sided
99 percent confidence interval:
 -0.08115218  0.26115218
sample estimates:
prop 1 prop 2 
  0.38   0.29 
# If we use pellets.df, must "rotate" it first
pellets.df$Conform <- factor(pellets.df$Conform, 
                             levels = c("Yes", "No"))

# Unlike t.test(), prop.test() doesn't have data argument
xtabs(Count ~ Method + Conform, data = pellets.df) |>
  prop.test(correct = FALSE, conf.level = 0.99)

Interpretation of a confidence interval for p1p2

For CS 8.1, the 99% confidence interval for the difference in the two methods’ underlying proportions was (-0.08115218, 0.26115218)

Equation Reference: On the topic of independence…

Suppose you want to compare proportions within the same sample

\[ \text{se}(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1 + \hat{p}_2 + (\hat{p}_1 - \hat{p}_2)^2}{n}} \]

— Wild & Seber (2000)

The odds ratio

Another statistic that summarises a dataset with two categorical variables.

Odds

The odds of an event compare the chance that the event happens to the chance that it does not. Odds are typically expressed using a phrase with the structure “a to b”, so a ratio is implied but not actually computed.

— Utts & Heckard (2015)

Suppose a sample contains 1000 individuals, of which 400 carry the gene for a disease

In general, the higher the odds, the more likely the event is to happen

Definition: Odds

\[ \widehat{\text{Odds}} = \frac{\hat{p}}{1-\hat{p}} \]

where:

  • \(\widehat{\text{Odds}}\) is short for the observed odds of the event
  • \(\hat{p}\) is the sample proportion of the event

Odds ratio

The odds ratio is more suitable than the difference between two proportions when the two-way table summarises the outcome of an event. It compares the odds of an event for two different “groups”, e.g. ethnicities and regions (Utts & Heckard, 2015).

The odds values for the two categories being compared are computed as ratios, allowing us to describe how much more likely an event is in the first group compared to the second group

Suppose a sample contains 400 individuals from region A, of which 200 carry the gene for a disease and 600 individuals from region B, of which 200 carry the same gene for a disease.

Definition: Odds ratio

\[ \widehat{\text{OR}} = \frac{\widehat{\text{Odds}}_1}{\widehat{\text{Odds}}_2} = \frac{\hat{p}_1 \times (1- \hat{p}_2)}{(1- \hat{p}_1) \times \hat{p}_2} \]

where:

  • \(\widehat{\text{OR}}\) is short for the observed odds ratio of the event between the first and second groups
  • \(\widehat{\text{Odds}}_1\) is short for the observed odds of the event for the first group
  • \(\widehat{\text{Odds}}_2\) is short for the observed odds of the event for the second group
  • \(\hat{p}_1\) is the sample proportion of the event for the first group
  • \(\hat{p}_2\) is the sample proportion of the event for the second group

On the interpretation of an odds ratio

  • If \(\widehat{\text{OR}} = 1\), the odds of the event are the same for both groups
  • If \(\widehat{\text{OR}} > 1\), the odds of the event is higher in the first group compared to the second group
  • If \(\widehat{\text{OR}} < 1\), the odds of the event is lower in the first group compared to the second group

Assumptions for inference on OR

  1. Independent observations—typically met with random samples or randomisation of the data collection order with randomised experiments
  2. The collected data has “information” about the odds of an event

More on 1.

Inference on an odds ratio, OR, is more flexible than the method taught to infer
p1p2, as it is built on formal statistical model—see DATAX221.

CS 1.6 revisited: ICU admissions

Recall that the data came from a sample of 200 patients.

Do we meet the assumptions for inference on an OR?

What are the potential consequences for not meeting the independence assumption?

1.1111111

# Two-way table of counts/frequencies
xtabs( ~ Sex + Status, data = icu.df) |>
  addmargins()
        Status
Sex      died lived Sum
  female   16    60  76
  male     24   100 124
  Sum      40   160 200

A confidence interval for OR

In DATAX121, we will only focus on how to use R to construct the interval and how to interpret such an interval

CS 1.6 revisited: ICU admissions

# For the first time in DATAX121, we will load another R package
library(epitools)
# Check that the "event" is along the columns of the two-way table
icu.tab <- xtabs( ~ Sex + Status, data = icu.df)
icu.tab
        Status
Sex      died lived
  female   16    60
  male     24   100
# The following function from epitools calculates the 95% C.I. for OR
oddsratio.wald(icu.tab, conf.level = 0.95)
$data
        Status
Sex      died lived Total
  female   16    60    76
  male     24   100   124
  Total    40   160   200

$measure
        odds ratio with 95% C.I.
Sex      estimate     lower    upper
  female 1.000000        NA       NA
  male   1.111111 0.5468526 2.257588

$p.value
        two-sided
Sex      midp.exact fisher.exact chi.square
  female         NA           NA         NA
  male    0.7687874    0.8558382  0.7707773

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
  • If this line causes an error, see Workshop 8 (when available)
  • The epitools package expects the data to be in the form of a two-way table—see T01: Summarising Data, Slides 44–49
  • The numerator odds of the odds ratio is printed as the first line of the $measure R output

Interpretation of a confidence interval for OR

The 95% confidence interval for the odds ratio of the event, “patient died after admission to ICU”, for females compared to males was (0.5468526, 2.257588)

Note that the 95% confidence interval is asymmetrical about \(\widehat{\text{OR}}\)

Reference: (1 - α)% Confidence interval for OR

This C.I. method is specfically for an odds ratio from a 2-by-2 table of counts

Two-way table

\[ \left[ \begin{array}{cc} a & b \\ c & d \end{array} \right] \]

\[ \log\left(\widehat{OR}\right) \pm z^*_{1-\alpha/2} \times \text{se}\left\{\log\left(\widehat{OR}\right)\right\} \]

where:

  • \(\log\left(\widehat{OR}\right) = \log\left(\frac{a \times d}{b \times c}\right)\) is the natural logarithm of the estimated odds ratio
  • The confidence level is \((1 - \alpha)\), where \(\alpha\) is a proportion
  • \(z^*_{1-\alpha/2}\) is the z-multiplier for the prescribed confidence level of \((1 - \alpha)\)
  • \(\text{se}\left\{\log\left(\widehat{OR}\right)\right\} = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}\) is the standard error of \(\log\left(\widehat{OR}\right)\)
  • To get in terms of an odds ratio, we exponentiate the lower and upper limits of the confidence interval
    • That is, \(\left[\exp(\text{Lower Limit}), \, \exp(\text{Upper Limit})\right]\)

Non-examinable

Exemplars

The following exemplars only have a context and a C.I. interpretation

E 8.2: Crows Never Forget a Face

Biologists studying crows will capture a crow, tag it, and release it. These crows seem to remember the scientists who caught them and will scold them later. A study to examine this effect with caveman masks found that crows scolded a person wearing a caveman mask in 158 out of 444 encounters with crows, whereas crows scolded a person in a neutral mask in 109 out of 922 encounters.

Let \(p_c\) be the proportion of scoldings when volunteers are wearing the caveman mask and \(p_b\) be the proportion of scoldings when volunteers are wearing the neutral mask

If we construct a 90% confidence interval for \(p_c - p_b\), we get \((0.197, 0.279)\)

We are 90% sure that the proportion of crows that will scold is between 0.197 and 0.279 higher if the volunteer is wearing the caveman mask than if he or she is wearing the neutral mask.

E 8.3: Crows Never Forget a Face

A survey of students in their final year of high school asked whether they had ever used marijuana. It was found that 515 out of 1146 males had used marijuana, and 445 out of 1120 females and used marijuana. A health researcher wanted to use this data to infer the odds ratio of marijuana use between males and females.

Let \(\text{Odds}_M\) be the odds of males that had used marijuana and \(\text{Odds}_F\) be the odds of females that had used marijuana

If we construct a 95% confidence interval for \(\text{OR} = \displaystyle \frac{\text{Odds}_M}{\text{Odds}_F}\), we get \((1.0477, 1.4629)\)

With 95% confidence, we estimate that the odds of males that had used marijuana is somewhere between 1.05 and 1.46 times the odds of females that had used marijuana